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Abstract 

Wc describe a method for inferring linear causal relations among multi- 
dimensional variables. The idea is to use an asymmetry between the dis- 
tributions of cause and effect that occurs if both the covariance matrix 
of the cause and the structure matrix mapping cause to the effect are 
independently chosen. The method works for both stochastic and deter- 
ministic causal relations, provided that the dimensionality is sufficiently 
high (in some experiments, 5 was enough). It is applicable to Gaussian 
as well as non-Gaussian data. 

1 Motivation 

Inferring the causal relations that have generated statistical dependencies among 
a set of observed random variables is challenging if no controlled randomized 
studies can be made. Here, causal relations are represented as arrows connecting 
the variables, and the structure to be inferred is a directed acyclic graph (DAG) 
[1, 2]. The constraint-based approach to causal discovery, one of the best known 
methods, selects directed acyclic graphs that satisfy both the causal Markov con- 
dition and faithfulness: One accepts only those causal hypotheses that explain 
the observed dependencies and demand that all the observed independencies 
are imposed by the structure, i.e., common to all distributions that can be gen- 
erated by the respective causal DAG. However, the methods are fundamentally 
unable to distinguish between DAGs that induce the same set of dependencies 
(Markov-equivalent graphs). Moreover, causal faithfulness is known to be vio- 
lated if some of the causal relations are deterministic [3]. Solving these problem 



requires reasonable prior assumptions, either implicitly or explicitly as priors on 
conditional probabilities, as in Bayesian settings [4]. However, the fact that de- 
terministic dependencies exist in real-world settings shows that priors that are 
densities on the parameters of the Bayesian networks, as it is usually assumed, 
are problematic, and the construction of good priors becomes difficult. 

Recently, several methods have been proposed that are able to distinguish be- 
tween Markov-equivalent DAGs"'^. Linear causal relations among non-Gaussian 
random variables can be inferred via independent-component-analysis (ICA) 
methods [6, 7]. The method of [8] is able to infer causal directions among real- 
valued variables if every effect is a (possibly non-linear) function of its causes 
up to an additive noise term that is independent of the causes. The work of 
[9] augmented these models by applying a non-linear function after adding the 
noise term. If the noise term vanishes or if all of the variables are Gaussian 
and the relation is linear, all these methods fail. Moreover, if the data are high- 
dimensional, the non-linear regression involved in the methods becomes hard to 
estimate. 

Here we present a method that also works for these cases provided that the 
variables are multi-dimensional with sufficiently anisotropic covariance matrices. 
The underlying idea is that the causal hypothesis X ^ Y is only acceptable if 
the shortest description of the joint distribution P{X, Y) is given by separate 
descriptions of the input distribution P{X) and the conditional distribution 
P{Y\X) [10], expressing the fact that they represent independent mechanisms 
of nature. [11] shows toy examples where such an independent choice often leads 
to joint distributions where P{Y) and P{X\Y) satisfy non-generic relations 
indicating that Y ^ X is wrong. Here we develop this idea for the case of 
multi-dimensional variables X and Y with a linear causal relation. 

We start with a motivating example. Assume that X is a multivariate Gaus- 
sian variable with values in R" and the isotropic covariance matrix Cxx = I- 
Let Y be another M"-valued variable that is deterministically influenced by X 
via the linear relation Y = AX for some n x n-matrix A. This induces the 
covariance matrix 

Cyy = AC'xxA^ = AA^. 

The converse causal hypothesis Y ^ X becomes unlikely because P{Y) (which 
is determined by the covariance matrix AA^) and P{X\Y) (which is given by 
X = A~^Y with probability 1) are related in a suspicious way, since the same 
matrix A appears in both descriptions. This untypical relationship between 
P{Y) and P{X\Y) can also be considered from the point of view of symmetries: 
consider the set of covariance matrices UCyyU^ with U € 0{n), where 0{n) 
denotes the orthogonal group. Among them, Cyy is special because it is the 
only one that is transformed into the isotropic covariance matrix Cxx- More 
generally speaking, in light of the fact of how anisotropic the matrices 

Cxx ■■= A-^UCyyU^A-'^ 

^In particular, the elementary problem "infer whether X eauses y or y causes X" has been 
part of the challenge at the NIPS 2008 workshop "Causality; Objectives and Assessment" [5] 
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are for generic U, the hypothetical effect variable is surprisingly isotropic for 
U = 1 (here we have used the short notation A~'^ := (A^^)^). We will show 
below that this remains true with high probability (in high dimensions) if we 
start with an arbitrary covariance matrix Cxx and apply a random linear trans- 
formation A chosen independently of Cxx- 

To understand why independent choices of Cxx and A typically induce un- 
typical relations between A~^ and Cyy we also discuss the simple case that 
Cxx and A are simultaneously diagonal with cj and aj as corresponding diago- 
nal entries. Thus Cyy is also diagonal and its diagonal entries (eigenvalues) are 
cijCj. We now assume that "nature has chosen" the values Cj with j = 1, . . . , n 
independently from some distribution and aj from some other distribution. We 
can then interpret the values Cj as instances of n-fold sampling of the random 
variable c with expectation E(c) and the same for aj. If we assume that a and 
c are independent, we have 

E(a2c) = E(a2)E(c) . (1) 

Due to the law of large numbers, this equation will for large n approximatively 
be satisfied by the empirical averages, i.e., 

For the backward direction Y ^ X we observe that the diagonal entries Cj = 
Cja'j of Cyy and the diagonal entries aj = aj^ of A := A~^ have not been 
chosen independently because 

Eia^c) = E(c) , 

whereas 

E{a^)E{d) = E{a-^)E{a^c) = E{a-^)E{a^)E{c) > E(c) . 

The last inequality holds because the random variables and are always 
negatively correlated (this follows easily from the Cauchy-Schwarz inequality 
E(a^)E(a~^) < 1) except for the trivial case when they are constant. We thus 
observe a systematic violation of (1) in the backward direction. The proof for 
non-diagonal matrices in Section 2 uses standard spectral theory, but is based 
upon the same idea. 

The paper is structured as follows. In Section 2, we define an expression 
with traces on covariance matrices and show that typical linear models induce 
backward models for which this expression attains values that would be untyp- 
ical for the forward direction. In Section 3 we describe an algorithm that is 
based upon this result and discuss experiments with simulated and real data. 
Section 4 proposes possible generalizations. 
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2 Identifiability results 



Given a hypothetical causal model Y = AX + E (where X and Y are n- and m- 
dimensional, respectively) we want to check whether the pair [Cxx^A) satisfies 
some relation that typical pairs {U CxxU^, A) only satisfy with low probability 
if U G 0{n) is randomly chosen. To this end, we introduce the renormalized 
trace 

T„(.) := tr(.)/n 
for dimension n and compare the values 

TrniACxxA^) and r„(Cxjf )t,„(AA^) . (3) 

One shows easily that the expectation of both values coincide if Cxx is randomly 
drawn from a distribution that is invariant under transformations 

Cxx ^ UCxxU^- 

This is because averaging the matrices UCxxU'^ over all U G 0{n) projects 
onto Tn{Cxx)^ since the average UCxxU^ commutes with all matrices and 
is therefore a multiple of the identity. For our purposes, it is decisive that 
the typical case is close to this average, i.e., the two expressions in (3) almost 
coincide. To show this, we need the following result [13] : 

Lemma 1 (Levy's Lemma) 

Let g : Sn ^ ^ be a Lipschitz continuous function on the n- dimensional sphere 
with 

L:^maxM7)^. 
iT^Y 111 -7 II 

// a point 7 on Sn is randomly chosen according to an 0{n) -invariant prior, it 
satisfies 

\g{j) - 5l < e 

with probability at least 1 — exp(— K(n — l)e^/L^) for some constant k, where g 
can be interpreted as the median or the average of g{-f). 

Given the above Lemma, we can prove the following Theorem: 

Theorem 1 (traces are typically multiplicative) 

Let C be a symmetric, positive definite n x n-matrix and A an arbitrary mxn- 
matrix. Let U be randomly chosen from 0{n) according to the unique 0{n)- 
invariant distribution (i.e. the Haar measure). Introducing the operator norm 

\\B\\ := max ||Ba;||, 

||x||=l" 

UUG- hOiVG- 

\Tm{AUCU^A'^)-Tn{C)TmiAA'')\ < 2e||C||m^^|| 

with probability at least q := 1 — exp(— K;(n — l)e^) for some constant k (inde- 
pendent of C, A,n,m,e). 
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Proof: for an arbitrary orthonormal system (V'j)j=i,...,m we have 

Trai.AUCU'^A^) = — ^{'4>j,AUCU'^A^iPj) . 

We define the unit vectors 

Dropping the index j, we introduce the function 

/(7) := (7,C'7)- 

For a randomly chosen U e 0(n), 7 is a randomly chosen unit vector according 
to a uniform prior on the n-dimensional sphere S'„ . 

The average of / is given by /= r„(C). The Lipschitz constant is given by 
the operator norm of C, i.e., L = 2||C|| . An arbitrarily chosen j satisfies 

l(7i,C7i)-T„(C)|<2e||q| 

with probability 1— exp(— K(n— l)e^). This follows from Lemma 1 after replacing 
e with eL. Hence 

\{i,j,AUCU^A'^i,j) - T„(C)(V,-,^^^^,-)| < 2e||C7||||AA^|| . 

Due to 

^ m 

TmiAUCU^A^) = - ^(V,-,AA^^,-)(7,-,C7,) , 

we thus have 

ItUAUCU^A^^) - Tm{AA'^)Tn{C)\ < 2e||C|| . 

□ 

It is convenient to introduce 

A{C,A) := logT^iACA^) - logT„(C) - logT„(AA^) 

as a scale-invaxiant measure for the strength of the violation of the equality of 
the expressions (3). 

We now restrict the attention to two special cases where we can show that 
A is non-zero for the backward direction. First, we restrict the attention to 
deterministic models 

Y = AX 

and the case that m> n where A has rank n. This ensures that the backward 
model is also deterministic, i.e., 

X = A-^Y, 

with (.)~^ denoting the pseudo inverse. 

The following theorem shows that A{Cxx,A) = implies A{Cyy,A~^) < 

0: 



5 



Theorem 2 (systematic violation of trace multiplicativity) 

Let n and m denote the dimensions of X and Y , respectively. IfY = AX and 
X = A~^Y, the covariance matrices satisfy 

A{Cxx,A) + A{Cyy, = - log (1 - Cov{Z, 1/Z)) + log - , (4) 

m 

where Z is a real-valued random variable whose distribution is the empirical 
distribution of eigenvalues of AA^ , i.e., Tra{{AA^)'^) — E(Z'') for all A: G N. 

Proof: We have 

Tn{C) 1 Tn{C)T,n{A'' A) 

Tm{ACAT)T^{A-^A-T) t^{AAT)t,,{A-'A-t) t^{ACAT) ' 

Using 

Tn{A-'A-^) = Tn{A-^A-^) = r„((A^^)-^) = ^r„((AA^)-i) 

and taking the logarithm we obtain 

AiACA-, A-^) = log log ^ - AiC A) . 

Then the statement follows from 

Cov{Z, l/Z) = 1 - E(Z)E(1/Z) . 

□ 

Note that the term — log(l — Cov(Z, l/Z)) in eq. (4) will not converge to 
zero for dimension to infinity if the random matrices A are drawn in a way that 
ensures that the distribution of Z converges to some distribution on M with 
non-zero variance. Assuming this, A{Cyy,A~^) tends to some negative value 
if A{Cxx, A) tends to zero for n = m — > oo. 

Wc should, however, mention a problem that occurs for m > n in the noise- 
less case discussed here: Since Cyy has only rank n, we could equally well 
replace A~^ with some other matrix A that coincides with A~^ on all of the 
observed y-valucs. For those matrices A, the value A can get closer to zero 
because the term log n/m expresses the fact that the image of Cyy is orthogonal 
to the kernel of A~^, which is already untypical for a generic model. 

It turns out that the observed violation of the multiplicativity of traces can 
be interpreted in terms of relative entropy distances. To show this, we need the 
following result: 

Lemma 2 (relative entropy in terms of determinants and traces) 

Let C be the covariance matrix of a centralized non- degenerate multi-variate 
Gaussian distribution Pc in n dimensions. Let the anisotropy of C be defined 
by the relative entropy distance to the closest isotropic Gaussian 

D{C):= min D{Pc\\Q) . 
Q isotropic 
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Then 

D{C) = \ (nlogT„(C) - logdet(C7)) . (6) 

Proof: the relative entropy distance of two centralized Gaussians with covariance 
matrices C, Co in n dimensions is given by 

I>(Pd|Pc.) = i(l0g(^^)+.r(C.-.O)-„). 

Setting Co = AI, the distance is minimized for A = r„(C), which yields eq. (6). 

□ 

Straightforward computations show: 

Theorem 3 (multiplicativity of traces cind relative entropy) 

Let C and A be n x n-matrices with C positive definite. Then 

DiACA^) = D{C) + D{AA^) + A(C, A) . 

Hence, for independently chosen A and C, the anisotropy of the output 
covariance matrix ACA^ is approximately given by the anisotropy of C plus 
the anisotropy of AA'^ , which is the anisotropy of the output that A induces on 
an isotropic input. For the backward direction, the anisotropy is smaller than 
the typical value. 

We now discuss an example with a stochastic relation between X and Y. 
We first consider the general linear model 

Y = AX + E, 

where ^ is an n x m matrix and is a noise term (statistically independent of 
X) with covariance matrix Cee- We obtain 

Cyy = ACxxA^ + Cee ■ 

The corresponding backward modeP reads 

X = AY + E. 

with 

A := CxyCyy ■ 

Now wc focus on the special case where A is an orthogonal transformation and 
E is isotropic, i.e., Cee = AI with A > 0. We then obtain a case where Cyy 
and A are related in a way that makes A{Cyy, A) positive: 

^For non-Gaussian X, E, this induces a joint distribution P(X, Y) that does not admit a 
linear backward model with an independent noise E, we can then only obtain uncorrelated 
noise. We could in principle already use this fact for causal inference [6]. However, our 
method also works for the Gaussian case and if the dimension is too high for testing higher- 
order statistical dependences reliably. 
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Lemma 3 (violation of multiplicativity of traces for a special noisy Ccise) 

Let Y = AX + E with A e 0{n) and the covariance matrix of E he given by 
Cee = AI. Then we have 

A(Cry,A) > 0. 

Proof: We have 

Cyy = ACA^ + AI and Cyx = AC , 

with C := Cxx- Therefore, 

A = CA^{ACA^ + AI)-i = C{C + AI)-^^^. 

One checks easily that the orthogonal transformation A is irrelevant for the 
traces and we thus have 

i^ 1 t{C\C + \1)-^) E(ZV(Z+A)) 

'~ ^^t{C + \1)t{C^{C+\1)-^) " ^^E(Z + A)E(Z2/(Z + A)2) ' 

where Z is a random variable of which distribution reflects the distribution of 
eigenvalues of C. The function z ^ z/{z + \) is monotonously increasing for 
positive A and z and thus also z i— > z"^ / {z + A)^. Hence Z + \ and Z'^ / {Z + A)^ 
are positively correlated, i.e., 

E{Z^/{Z + A)) = E((Z + \)Z^/{Z + Xf) > E{Z + A)E {Z^/{Z + Xf) , 

for all distributions of Z with non-zero variance. Hence the logarithm is positive 

and thus A{Cyy,A) > 0. □ 

Since the violation of the equality of the terms in (3) can be in both direc- 
tions, we propose to prefer the causal direction for which A is closer to zero. 

3 Inference algorithm and experiments 

Motivated by the above theoretical results, we propose to infer the causal di- 
rection using Alg. 1.^ 

In light of the theoretical results, the following issues have to be clarified by 
experiments with simulated data: 

1. Is the limit for dimension to infinity already justified for moderate dimen- 
sions? 

2. Is the multiplicativity of traces sufficiently violated for noisy models? 

Furthermore, the following issue has to be clarified by experiments with real 
data: 



^Please note that this algorithm, including the complete code to repro- 
duce the experiments reported in this paper, is available as R code at: 
http : //www . OS . helsinki . f i/u/phoyer/code/hdlin . tar . gz 
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Algorithm 1 Identifying linear causal relations via traces 
1: Input: {xi,yi),...,{xk,yk) 

2: Compute the estimators Cxx, Cxy, Cyx, Cyy 
3: Compute A := CyxC^x 
4: Compute A := CxyCyy 

5: if \logTm{ACxxA^) - logTniC'xx) logr„(AA^)| > e + 

I log r„(iCYyi^) - log r„(CFr) - logT„(ii^)| then 
6: write "y is the cause" 
7: else 

8: if I log T„(iCYyi^) - log T™(Cyy) - logT„(ii^)| > 6 + 

I logTmiACxxA^) - logTniCxx) - logr„(AA'^)| then 
9: write "X is the cause" 
10: else 

11: write "cause cannot be identified" 
12: end if 
13: end if 



3. Is the behaviour of real causal structures qualitatively sufficiently close to 
our model with independent choices of A and Cxx according to a uniform 

prior? 

For the simulated data, we have generated random models Y = AX + £^ as 
follows: We independently draw each element of the m x n structure matrix A 
from a standardized Gaussian distribution. This implies that the distribution 
of column vectors as well as the distribution of row vectors is isotropic. To 
generate a random covariance matrix Cxx, we similarly draw an n x n matrix 
B and set Cxx ■= BB^ . Due to the invariance of our decision rule with respect 
to the scaling of A and Cxx-, the structure matrix and the covariance can have 
the same scale without loss of generality. The covariance Cee oi the noise is 
generated in the same way, although with an adjustable parameter a governing 
the scaling of the noise with respect to the signal: cr = yields the deterministic 
setting, while a = 1 equates the power of the noise to that of the signal. 

First, we demonstrate the performance of the method in the close-to de- 
terministic setting (a = 0.05) as a function of the dimensionality n = to of 
the simulations, ranging from dimension 2 to 50. To show that the method is 
feasible even with a relatively small number of samples, we choose the number 
of samples N to scale with the dimension as iV = 2n. (Note that we must 
have A'' > min(n, m) to obtain invertible estimates of the covariance matrices.) 
The resulting proportion of correct vs wrong decisions is given in Fig. la, with 
the corresponding values of A in Fig. lb. As can be seen, even at as few as 5 
dimensions and 10 samples, the method is able to reliably identify the direction 
of causality in these simulations. 

To illustrate the degree to which identifiability is hampered by noise, the 
solid line in Fig. Ic gives the performance of the method for a fixed dimension 
(n = m = 10) and fixed sample size {N = 1000) as a function of the noise 
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Figure 1: Simulation results, (a) Performance of the method as a function of 
the input dimensionality n, when the output dimensionality m — n and the 
sample size is = 2n. The green curve denotes the fraction of simulations 
on which the true causal direction was selected, while the red curve gives the 
fraction of wrong answers, (b) Mean values of A corresponding to the true 
direction (green) vs the wrong direction (red), (c) Performance as a function 
of noise level cr, for dimensionality n = m = 10 and sample size N = 1000. To 
compare, the dashed lines give the performance based on the exact covariance 
matrices rather than based on the samples, (d) Mean values of A corresponding 
to the true direction (green) vs the wrong direction (red). See main text for 
discussion. 



level a. As can be seen, the performance drops markedly as a is increased. As 
soon as there is significantly more noise than signal (say, tr > 2), the number of 
samples is not sufficient to reliably estimate the required covariance matrices and 
hence the direction of causality. This is clear from looking at the much better 
performance of the method when based on the exact, true covariance matrices, 
given by the dashed lines. In Fig. Id we show the corresponding values of A, 
from which it is clear that the estimate based on the samples is quite biased for 
the forward direction. 

As experiments with real data with known ground truth, we have chosen 
16 X 16 pixel images of handwritten digits [14]. As the linear map A we have 
used both random local translation-invariant linear filters and also standard 
blurring of the images. (We added a small amount of noise to both original and 
processed images, to avoid problems with very close-to singular covariances.) 
See Fig. 2 for some example original and processed image pairs. The task is then: 
given a sample of pairs (xi,?/i) consisting of the picture Xj and its processed 
counterpart yj infer which of the set of pictures x or y are the originals ('causes'). 
By partitioning the image set by the digit class (0-9), and by testing a variety 
of random filters (and the standard blur), we obtained a number of test cases 
to run our algorithm on. Out of the total of 100 tested cases, the method was 
able to correctly identify the set of original images 94 times, with 4 unknowns 
(i.e. only two falsely classified cases). 

These simulations and experiments are quite preliminary and mainly serve 
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to illustrate the theory developed in the paper. They point out at least one 
important issue for future work: the construction of unbiased estimators for the 
trace values or the A. The systematic deviation of the sample-based experiments 
from the covariance-matrix based experiments in Fig. Ic-d suggest that this 
could be a major improvement. 

4 Outlook: generalizations of the method 

In this section, we want to rephrase our theoretical results in a more abstract way 
to show the general structure. We have rejected the causal hypothesis X — > F 
if we observe that Tn{ACxxA'^) attains values that are not typical among the 
set of transformed input covariance matrices UCxxU^ ■ In principle, we could 
have any function K that maps the output distribution P{Y) to some value 
K{P{Y)). Moreover, we could have any group G of transformations g on the 
input variable X that define transformed input distributions via 

P,{X) = Pig-'X) . 

Applying the conditional P{Y\X) to Pg{X) defines output distributions P^^\Y) 
that we compare to P{Y). In particular, we check whether the value K{P{Y)) 
is typical for the set K{P^9) (Y))g^a. 

Postulate 1 (distribution of effect is typical for the group orbit) 

Let X and Y be random variables with joint distribution P{X, Y) and G be a 
group of transformations of the value set of X . Let K{.) be some real-valued 
function on the probability distributions ofY. The causal hypothesis X ^ Y is 
unlikely if K{P{Y)) is smaller or greater than the big majority of all distribu- 
tions (p(s)(y))geG 

Our prior knowledge about the structure of the data set determines the ap- 
propriate choice of G. The idea is that G expresses a set of transformations that 
generate input distributions Pg{X) that we consider equally likely. The permu- 
tation of components of X also defines an interesting transformation group. For 
time series, the translation group would be the most natural choice. 

Interpreting this approach in a Bayesian way, we thus use symmetry prop- 
erties of priors without the need to explicitly define the priors themselves. 

5 Discussion 

Our experiments with simulated data suggest that the method performs quite 
well already for moderate dimensions provided that the noiselevel is not too 
high. Certainly, the model of drawing Cxx according to a distribution that 
is invariant under Cxx ^ UCxxU^ may be inappropriate for many practical 
applications. However, as the example with diagonal matrices in Section 1 
shows, the statement /S.{Cxx,A) = holds for a much broader class of models. 
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Figure 2: Example of original and filtered digit data, (a) Original digit im- 
age, (b &: c) Transformed image using two different random local translation- 
invariant linear filters, (d) Transformed image using a simple local blur filter. 



For this reason, the method could also be used as a sanity check for causal 
hypotheses among one-dimensional variables. Assume, for instance, one has a 
causal DAG G connecting 2n variables attaining values in K. If Xi,. . . ,X2n 
is an ordering that is consistent with G, we define Y := (Xi,...,X„) and 
W :~ {Xn+i, . . . , X2n) and check the hypothesis Y ^ W using our method. 
Provided that the true causal relations are linear, such a hypothesis should be 
accepted for every possible ordering that is consistent with the true causal DAG. 
This way one could, for instance, check the causal relation between genes by 
clustering their expression levels to vector- valued variables. 
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